White Wine - Exploratory Data Analysis by Elizabeth Wanjiku

Short Description

This report explores a data set containing 4,898 white wines with 11 variables that quantify the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

Univariate Plots Section

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : Ord.factor w/ 7 levels "3"<"4"<"5"<"6"<..: 4 4 4 4 4 4 4 4 4 4 ...

Above we see the variables and their assigned data types in our data set .

We decided to change the variable quality into a categorical ordinal variable - as we will not be performing any calculations on it but using to rate the wines on a od scale 0 (very bad) to 10 (very excellent)

##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

> Variable quality We started by exploring quality variable - which we saw was normally distributed with majority of the wine being rated 6 (above average)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

Variable alcohol Alcohol content in this wine is left-skewed with the most common value at 9.5, and ranges from 8.00 to 14.20

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4100  0.4700  0.4898  0.5500  1.0800
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Variable suphates Sulphate content in this wine is normally distributed with the most common amount at 0.45, ranging from 0.22 to 1.08, it has mean level of 0.4898 and 50% of the wines have sulphate content of 0.47

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

Variable pH the wine pH ranges from 2.7 to 3.82 and is nomally distibuted, with the most wines with a pH of 3.0 - with 50% to 75% of the wines with pH of 3.18 to 3.28

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##      99% 
## 1.000302

Variable density On average the wine has a density level of 0.994 . We got to see the density is a normally distributed with 50% to 75% of the wines with density levels of 0.9937 to 0.9961

We then explored density variable and zoomed into the bulk of it distribution to get a better view on the plot by omitting the top 1% of density values - the most wines have density levels around 0.993

I think it would be important to learn how the amount alcohol per unit density affects the preference of the critics - therefore i will create a new variable alcohol.density = alcohol / density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.025   9.517  10.445  10.580  11.486  14.376
##     99% 
## 13.5418

Variable alcohol.density On average the wine has a density level of 10.58 . We got to see the alcohol.density is normally distributed with 50% to 75% of the wines with values ranging from 10.445 to 14.376

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

##    99% 
## 241.03

Variable sulphur.dioxide wine Total Sulfur Dioxide is normally distibuted - with at least 50% to 75% having a level ranging from 134.0 to 167.0

We then zoomed into the bulk of Total Sulfur Dioxide distribution to get a better view on the plot by omitting the top 1% of its values - the most wines have a Total Sulphur Dioxide level of 120

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   23.00   34.00   35.31   46.00  289.00

## 99% 
##  81

Variable sulphur.dioxide free.sulfur.dioxide

Variable sulphur.dioxide wine Free Sulfur Dioxide has normal distribution - with at least 25% to 75% having a levels ranging from 23.0 to 46.0

We then zoomed into the bulk of Free Sulfur Dioxide distribution to get a better view on the plot by omitting the top 1% of its values - the most wines have a Total Sulphur Dioxide level of 32.5

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

##   95% 
## 0.067

Variable chlorides 50 to 75% of the wine has chloride levels ranging from 0.043 to 0.05 - it has a normal distribution - We plotted the bulk of the distribution by cutting the top 5% of its values - this showed that a large number of the wines have chloride levels of 0.0475

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.600   1.700   5.200   6.391   9.900  65.800

Variable residual.sugar On Average we saw the wine residual sugar level is at 6.391 . 50 to 75% of the wine had residual sugar levels of 5.2 to 9.9 with the maximum recorded at 65.8

We then omitted the top 1% of its values - and saw it had left skewed distribution with a long tail - so as not to get a better look at the distribution (without being distracted by its tail) we used a log transformation and observed a bi-modial distribution - with maximum counts at around 1.5 and around 9.75

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

##  99% 
## 0.74

Variable acidity We have some wines have citric acid of 0 - and a maximum citric acid content of 1.66 - with 50 to 75% of the wines having citric acid levels of 0.32 to 0.39. The wine citric acid has a normal distibution

We then zoomed into the bulk of Citric Acid distribution by omitting the top 1% of its values - and we saw the most of the wines have a level of 0.325

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

##  99% 
## 0.63

Variable volatile.acidity Wine Volatile Acidity levels of range from 0 to 1.1 - with 50 to 75% of the wine with levels ranging from 0.26 to 0.32 . Volatile Acidity is a normal distibution with most the wines having recorded levels of 0.275

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

Variable fixed.acidity Wine Fixed Acidity levels of range from 3.8 to 14.2 with atleast 50 to 75% recording a levels from 6.8 to 7.3 . To get a better view of the bulk of the distribution we cut of the top 1% of its value we saw the most wines recorded a level of 6.875. Wine Fixed Acidity is normally distributed

Univariate Analysis

What is the structure of your dataset?

There are 4,898 white wines in the dataset with 11 features (fixed and volatile acidity, free and total sulfur dioxide, citric acid, residual sugar, chloride and suplhate levels, wine pH, density and alcohol levels as well as the quality ratings).

The variable quality rating is an ordered factor variable with levels 0 (very bad) to 10 (very excellent)

Other observations :

  • the most common quality rating was 6 (above average)
  • Alcohol content ranges from 8.00 to 14.20
  • on average wine has alcohol content per unit density of 10.58
  • on average wine sulphates levels are at 0.4898
  • 50 to 75% of the wine had a pH value ranging from 3.18 to 3.28
  • 50 to 75% of the wine had chloride levels ranging from 0.043 to 0.05
  • on average wine density level is at 0.994
  • on average free- and total- suplhur dioxide levels of 35.31 and 138.4 (respectively)
  • on average volatile- and fixed- acidity levels of 0.2782 and 6.855 (respectively)
  • we saw the residual sugar levels we skewed to the left with a long tail
  • a majority of the wine had citric acid levels od 0.325

PS : variable X represents the index of the observations and was not used during the Exploratory Data Analysis Process

What is/are the main feature(s) of interest in your dataset?

The main feature in the data set is the quality rating. I would like to determine which features are best for predicting the quality rating of white wine.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Alcohol content, Acidity (fixed and volatile), Residual Sugar, Total Sulphur Oxide and pH are likely to contribute to quality rating.

Did you create any new variables from existing variables in the dataset?

Created a new variable alcohol.density = alcohol / density - since I think knowing how much alcohol content there is in wine per unit density would affect the rating awarded by a critic - I would like to study how in the Bi-variate Analysis section

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I found that residual sugar was left skewed with a long tail - so i used a log transformation on it. The transformed distribution now was bimodial with maximum counts at around 1.5 and around 9.75

Bivariate Plots Section

##                      fixed.acidity volatile.acidity total.sulfur.dioxide
## fixed.acidity                 1.00            -0.02                 0.09
## volatile.acidity             -0.02             1.00                 0.09
## total.sulfur.dioxide          0.09             0.09                 1.00
## pH                           -0.43            -0.03                 0.00
## alcohol                      -0.12             0.07                -0.45
## alcohol.density              -0.13             0.07                -0.45
## residual.sugar.log            0.07             0.09                 0.42
##                         pH alcohol alcohol.density residual.sugar.log
## fixed.acidity        -0.43   -0.12           -0.13               0.07
## volatile.acidity     -0.03    0.07            0.07               0.09
## total.sulfur.dioxide  0.00   -0.45           -0.45               0.42
## pH                    1.00    0.12            0.12              -0.18
## alcohol               0.12    1.00            1.00              -0.39
## alcohol.density       0.12    1.00            1.00              -0.40
## residual.sugar.log   -0.18   -0.39           -0.40               1.00

I excluded all variables that are not being explored currently (with the exception of quality as it is a non-numerical variable).

I noted that below have high correlations with each other - Fixed Acidity and pH - has a high correlation (r = -0.43) - Residual Sugar and Total Sulfur Dioxide (r = 0.42) - Residual Sugar and Alcohol (r = -0.39) - Total Sulfur Dioxide and Alcohol (r = -0.45)

Then we added quality in the plot matrix and what stood out most was its relationship with alcohol (r = 0.44), alcohol.density (r = 0.43) , volatile acidity (r = -0.19) and total sulphur dioxide (r = -0.17).

Now I want to look closer at plots involving some variables like alcohol, alcohol, volatile acidity, fixed acidity, residual sugar, total sulfur dioxide, pH

We started by studying the relationship between quality and alcohol - since there’s overplotting so we added a layer some transperancy and jittered the point so could we add some noise to alcohol and quality data

Overall it seems that critics are more likely to give a better rating when the wine has higher alcohol content. This is similar to what is describe by Waterhouse Lab (UC Davis) - wine with higher alcohols can have an aromatic effect.

Though this does not mean that alcohol content is the only feature that contributes to a better quality rating

## 
## Call:
## lm(formula = as.numeric(quality) ~ alcohol, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5317 -0.5286  0.0012  0.4996  3.1579 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.582009   0.098008   5.938 3.08e-09 ***
## alcohol     0.313469   0.009258  33.858  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7973 on 4896 degrees of freedom
## Multiple R-squared:  0.1897, Adjusted R-squared:  0.1896 
## F-statistic:  1146 on 1 and 4896 DF,  p-value: < 2.2e-16

Above shows that alcohol contribute to about 19% of the quality rating (based on the R-squared) - implying we have other variables to the variation of quality ratings.

## # A tibble: 7 x 2
##   quality mean_alcohol.price
##   <ord>                <dbl>
## 1 3                    10.4 
## 2 4                    10.2 
## 3 5                     9.86
## 4 6                    10.6 
## 5 7                    11.5 
## 6 8                    11.7 
## 7 9                    12.3

Similar to what we saw earlier better ratings were given for wines with higher alcohol content per density unit

Then we proceeded to study the relationship between quality and volatile acidity - since there’s overplotting so we added a layer some transperancy and jittered the point so could we add some noise to pH and quality data

Waterhouse Lab (UC Davis) describes that volatile acidity is what defines wine spoilage and undersirable aromas - the lower its concentration should therefore improve the wine’s quality.

In line with my research , we can see with lower volatile acidity levels (around 0.15 to 0.35 ) - critics were more likely to give a better rating- 6 (above average)

Then we proceeded to study the relationship between quality and pH - since there’s overplotting so we added a layer some transperancy and jittered the point so could we add some noise to pH and quality data

we see that the wine with that has middle range of pH (ie 3.0 to 3.35 - not too low or high pH) was more likely to be given a good rating (ie average - 5 or above average - 6)

Then we proceeded to study the relationship between quality and residual sugar - since there’s overplotting so we added a layer some transperancy and jittered the point so could we add some noise to residual sugar and quality data. We also log-transformed the residual sugar - since as we had found on Univariate Plots Section - it does not have a normal distribution

we can see that Dry wine and Semi Sweet wine - wine with residual content (1 to 2 g/L) and (11 - 30 g/L) is more likely to be given a better rating compared to Off-Dry wine - wine with residual content of (5 - 10 g/L)

We see that with wines with lower total sulphur dioxide levels are awarded a better quality rating

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Quality correlates moderately with alcohol content - the higher the alcohol content the more likely the critic will give a better rating. Though based on R-squared value (from the linear model fit) it only explains around 19% of the variance in price - other features of interest should be incorated in to the model to explain the variance in quality ratings

Better ratings were given when the volatile acidity and Total Sulphur Dioxide levels were low

Wine that has moderate pH (ie not too high or too low) were more likely to get a better quality rating

The critics preferred wine that was either dry or semi-sweet as opposed to off-dry wine

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

The wine’s residual sugar content is higher correlated to the alcohol level (r = -0.39) - the higher the alcohol level the lower the sugar content (and vice-versa). This is expected as the residual sugar is the sugar remaining after fermentation stops or is stopped.

What was the strongest relationship you found?

Wine’s quality rating is highly correlated with alcohol levels (r = 0.44)

Multivariate Plots Section

The scatter plots above show the trends that were noted on the Bivariate Plot and Analysis section. Better quality ratings were awarded when there was more alcohol-content, higher pH, lower fixed acidity levels, lower volatile acidity levels and lower (total) sulphur dioxide levels

We decided to facet the relationship between fixed acidity and pH with the quality rating to better show that better ratings are given when pH and acidity is not too high or too low.

The above plots suggest that we can build a linear model and use the above variables in the linear model to predict the quality rating a critic gives

## 
## Re-fitting to get Hessian
## 
## 
## Re-fitting to get Hessian
## 
## 
## Re-fitting to get Hessian
## 
## 
## Re-fitting to get Hessian
## 
## 
## Re-fitting to get Hessian
## 
## 
## Re-fitting to get Hessian
## 
## 
## Re-fitting to get Hessian
## 
## Calls:
## m1: polr(formula = quality ~ alcohol.density, data = data)
## m2: polr(formula = quality ~ alcohol.density + volatile.acidity, 
##     data = data)
## m3: polr(formula = quality ~ alcohol.density + volatile.acidity + 
##     log10(residual.sugar), data = data)
## m4: polr(formula = quality ~ alcohol.density + volatile.acidity + 
##     log10(residual.sugar) + free.sulfur.dioxide, data = data)
## m5: polr(formula = quality ~ alcohol.density + volatile.acidity + 
##     log10(residual.sugar) + free.sulfur.dioxide + fixed.acidity, 
##     data = data)
## m6: polr(formula = quality ~ alcohol.density + volatile.acidity + 
##     log10(residual.sugar) + free.sulfur.dioxide + fixed.acidity + 
##     total.sulfur.dioxide, data = data)
## m7: polr(formula = quality ~ alcohol.density + volatile.acidity + 
##     log10(residual.sugar) + free.sulfur.dioxide + fixed.acidity + 
##     total.sulfur.dioxide + pH, data = data)
## 
## ===========================================================================================================================
##                               m1            m2            m3            m4            m5            m6            m7       
## ---------------------------------------------------------------------------------------------------------------------------
##   alcohol.density            0.762***      0.808***      0.945***      0.970***      0.959***      0.941***      0.937***  
##                             (0.024)       (0.025)       (0.027)       (0.028)       (0.028)       (0.029)       (0.029)    
##   3|4                        2.194***      1.029**       2.797***      3.402***      2.137***      1.873***      4.508***  
##                             (0.327)       (0.335)       (0.367)       (0.381)       (0.461)       (0.474)       (0.891)    
##   4|5                        4.447***      3.328***      5.130***      5.741***      4.486***      4.220***      6.850***  
##                             (0.250)       (0.259)       (0.300)       (0.318)       (0.408)       (0.423)       (0.864)    
##   5|6                        7.202***      6.254***      8.126***      8.751***      7.503***      7.235***      9.868***  
##                             (0.248)       (0.255)       (0.301)       (0.319)       (0.408)       (0.423)       (0.865)    
##   6|7                        9.593***      8.742***     10.658***     11.294***     10.053***      9.790***     12.429***  
##                             (0.269)       (0.274)       (0.320)       (0.339)       (0.422)       (0.436)       (0.874)    
##   7|8                       11.784***     10.928***     12.873***     13.521***     12.289***     12.026***     14.668***  
##                             (0.287)       (0.291)       (0.337)       (0.356)       (0.434)       (0.448)       (0.881)    
##   8|9                       15.446***     14.587***     16.542***     17.194***     15.966***     15.703***     18.345***  
##                             (0.527)       (0.529)       (0.556)       (0.568)       (0.620)       (0.630)       (0.986)    
##   volatile.acidity                        -5.093***     -5.695***     -5.517***     -5.555***     -5.410***     -5.345***  
##                                           (0.288)       (0.294)       (0.295)       (0.295)       (0.302)       (0.303)    
##   log10(residual.sugar)                                  0.940***      0.817***      0.834***      0.861***      0.904***  
##                                                         (0.077)       (0.080)       (0.080)       (0.081)       (0.082)    
##   free.sulfur.dioxide                                                  0.011***      0.010***      0.013***      0.014***  
##                                                                       (0.002)       (0.002)       (0.002)       (0.002)    
##   fixed.acidity                                                                     -0.162***     -0.153***     -0.099**   
##                                                                                     (0.033)       (0.033)       (0.037)    
##   total.sulfur.dioxide                                                                            -0.002*       -0.003**   
##                                                                                                   (0.001)       (0.001)    
##   pH                                                                                                             0.727***  
##                                                                                                                 (0.208)    
## ---------------------------------------------------------------------------------------------------------------------------
##   Aldrich-Nelson R-sq.       0.186         0.227         0.245         0.250         0.252         0.253         0.254     
##   McFadden R-sq.             0.089         0.114         0.126         0.129         0.131         0.131         0.132     
##   Cox-Snell R-sq.            0.205         0.255         0.278         0.283         0.286         0.287         0.289     
##   Nagelkerke R-sq.           0.221         0.276         0.300         0.306         0.310         0.311         0.313     
##   Likelihood-ratio        1121.493      1441.589      1592.852      1629.259      1653.123      1658.516      1670.772     
##   p                          0.000         0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood         -5759.841     -5599.792     -5524.161     -5505.958     -5494.025     -5491.329     -5485.201     
##   Deviance               11519.682     11199.585     11048.322     11011.916     10988.051     10982.658     10970.402     
##   AIC                    11533.682     11215.585     11066.322     11031.916     11010.051     11006.658     10996.402     
##   BIC                    11579.158     11267.558     11124.791     11096.881     11081.513     11084.617     11080.858     
##   N                       4898          4898          4898          4898          4898          4898          4898         
## ===========================================================================================================================

After adding all the variables under study we now account for 25.4% of the variance in the wine quality rating

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

Wine with higher amount of alcohol and lower total sulphur dioxide - were more likely to get better quality ratings

Better ratings were give when there were trace amounts of Free sulfur dioxide and lower levels of volatile acidity

From the Multivariate section I can now build a ordinal logistic model and use those variables to predict the wine critic’s quality rating

Were there any interesting or surprising interactions between features?

From research i saw that pH and acidity also affect the quality of wine - with too little acidity the wine can described as flat and unappealing; while if its too high the wine is so tart that that it would not be pleasing -

In line with the above most of the critics were observed to be more likely to give a good rating for wine that was not too high or to low on the pH and fixed acidity spectrum

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

Yes - I created an ordinal regression model starting with quality rating as described by alcohol which accounted for 19% of quality’s variation

I later updated it with variables volatile.acidity, residual.sugar (log transformed), free.sulfur.dioxide, fixed.acidity, total.sulfur.dioxide, pH - the model now accounts for 25.4% of quality’s variation


Final Plots and Summary

Plot One

Description One

Above is the scatter plot on the relationship between fixed acidity and pH with facetted over quality rating. This shows that most critics give better ratings when pH and acidity is not too high or too low.

Plot Two

Description Two

The scatter plot on the relationship between Alcohol Content Vs Residual Sugar - shows that wine with higher alcohol content (and lower residual sugar) were more likely to receive a better quality rating

Plot Three

Description Three

This is the distribution of Residual Sugar - it was left-skewed with a long tail. We performed a log transformation on it and observed a bi-modial distribution - with maximum counts at around 1.5 and around 9.75


Reflection

The White Wine data set contains information on 4,898 white wines with 11 variables that quantify the chemical properties of each wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).

As expected higher alcohol levels in the wine resulted in the critics giving a better quality grading on the wines - given that higher alcohols give an aromatic effect in wines. Also wine that had acidity (fixed) and pH that were not too high or too low were more likely to get better ratings - this was in line with what I found on Napa Valley Register website, that better wines have a proper acid levels. Too much acidity and the wine would be so tart it wouldn’t be pleasing; too little acidity and the wine becomes flat, dull and unappealing with food.

However, even after creating a model to describe quality based on the chemical features : alcohol/density, volatile and fixed acidity, residual sugar (log transformed), free and total sulfur dioxide and pH - I was only able to account for 25.4% of the wine’s quality rating variation. I think either more features and more data would be required to better understand and predict the quality ratings assigned to white wine.